Hello, in this project we will do simple data visualization using the BMI Analysis dataset.
The dataset in question consists of 741 distinct records, each of
which is briefly detailed with the following features:
* Age (in years): This field quantifies the age of each
individual, denominated in years. It serves as a chronological reference
for the dataset.
* Height (in meters): The “Height” column provides
measurements of the subjects’ stature in meters. This standardized unit
allows for precise representation and comparison of individuals’
heights.
* Weight (in kilograms): In the “Weight” column, the
weights of the subjects are quantified in kilograms. This unit ensures
consistency and accuracy in measuring the subjects’ mass.
* BMI (Body Mass Index): Derived from the height and
weight columns, the BMI column computes the Body Mass Index of each
individual. The calculation utilizes the formula: BMI = (Weight in kg) /
(Height in m^2). BMI is a vital numerical indicator used for
categorizing individuals based on their weight relative to their height.
It is expressed as a continuous variable.
* BmiClass: The “BmiClass” column categorizes
individuals based on their calculated BMI values. The categories include
“Obese Class 1,” “Overweight,” “Underweight,” among others. These
classifications are instrumental in health and weight analysis.
First and foremost, we will input the data
data <- read.csv("C:/Users/HP/Downloads/dv-bmiclass/bmi.csv")
head(data)
## Age Height Weight Bmi BmiClass
## 1 61 1.85 109.30 31.93572 Obese Class 1
## 2 60 1.71 79.02 27.02370 Overweight
## 3 60 1.55 74.70 31.09261 Obese Class 1
## 4 60 1.46 35.90 16.84181 Underweight
## 5 60 1.58 97.10 38.89601 Obese Class 2
## 6 59 1.71 79.32 27.12630 Overweight
Now, we will do the data inspection
dim(data)
## [1] 741 5
Now we found that the dataset consist of 5 column (variables) and 741 rows (individuals). Therefore, let’s find out about the data structure
str(data)
## 'data.frame': 741 obs. of 5 variables:
## $ Age : int 61 60 60 60 60 59 59 59 59 59 ...
## $ Height : num 1.85 1.71 1.55 1.46 1.58 1.71 1.7 1.72 1.46 1.83 ...
## $ Weight : num 109.3 79 74.7 35.9 97.1 ...
## $ Bmi : num 31.9 27 31.1 16.8 38.9 ...
## $ BmiClass: chr "Obese Class 1" "Overweight" "Obese Class 1" "Underweight" ...
As we can see, each variable has the right data types.
Now we’ll make sure that the data isn’t consisting any missing values.
anyNA(data)
## [1] FALSE
colSums(is.na(data))
## Age Height Weight Bmi BmiClass
## 0 0 0 0 0
Okay, we’re good to go. There are no missing values in the dataset.
Before getting deeper analysis, we’ll call necessary packages.
library(ggplot2)
library(reshape2)
library(gcookbook)
library(dplyr)
library(magrittr)
library(plotly)
cor(data$Height, data$Weight)
## [1] 0.6076716
From the result above, we can see that there are strong positive correlation between Height and Weight. The following scatter plot will explain visually the correlation of both variable completed with the BmiClass classification.
ggplotly(ggplot(data, aes(Weight, Height))+
geom_point(aes(colour = BmiClass))+
labs(title = "Scatter Plot of Weight and Height with the BMI Class Classified")
)
bmiclass_df <- data %>%
group_by(BmiClass) %>%
summarize(freq = n())
bmiclass_df
## # A tibble: 6 × 2
## BmiClass freq
## <chr> <int>
## 1 Normal Weight 342
## 2 Obese Class 1 20
## 3 Obese Class 2 55
## 4 Obese Class 3 62
## 5 Overweight 166
## 6 Underweight 96
ggplot(bmiclass_df, aes(x = reorder(BmiClass, freq), y = freq)) +
geom_bar(stat = "identity", fill = "pink") +
geom_text(aes(label = freq), vjust = -0.5) +
labs(title = "The Distribution of The Body Mass Index Category", x = "BMI Category", y = "Frequency")+
geom_hline(yintercept = mean(bmiclass_df$freq), color ="red", linetype = 5)+
coord_flip()
As we can see according to the bar plot above, normal weight and overweight category has an average over the mean of all categories’ frequencies.
ggplotly(ggplot(data, aes(BmiClass, Age))+
geom_boxplot(fill = "pink")+
geom_hline(yintercept = mean(data$Age), color ="red", linetype = 5)+
labs(title = "Boxplot of Age Distribution from Each BMI Class"))
The boxplot above shows that the mean age is 31.62 years old, and that each BmiClass’ age distribution is fine because there are no outliers in those boxplots. With the boxplot below, we can see it more clearly.
ggplot(data, aes(BmiClass, Age))+
geom_jitter(aes(col = data$Age))+
geom_boxplot(alpha = 0.5)+
labs(title = "Scatterplot of Each Category Boxplot's Age")
## Warning: Use of `data$Age` is discouraged.
## ℹ Use `Age` instead.
aggregate(Age~BmiClass, data, mean)
## BmiClass Age
## 1 Normal Weight 27.73977
## 2 Obese Class 1 40.90000
## 3 Obese Class 2 33.21818
## 4 Obese Class 3 31.11290
## 5 Overweight 39.18072
## 6 Underweight 29.83333
aggregate(Age~BmiClass, data, sd)
## BmiClass Age
## 1 Normal Weight 8.641413
## 2 Obese Class 1 16.789408
## 3 Obese Class 2 13.571062
## 4 Obese Class 3 9.927780
## 5 Overweight 10.772743
## 6 Underweight 13.680310
Well, according to the result above we can see that each BmiClass’ standard deviation are below the mean value. Which means that the age distribution according to the BmiClass are homogenously.
cor_matrix <-cor(data[,c(1:4)])
cor_melt <-melt(cor_matrix)
ggplot(cor_melt, aes(Var1, Var2, fill = value))+
geom_tile()+
geom_text(aes(label = round(value, 2)), color = "black", size = 3, vjust = 0.5)+
scale_fill_gradient2(high = "magenta", midpoint = 0)+
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(angle = 0, hjust = 1))
Now we can conclude the rank of correlation is listed below:
1. Bmi x Weight
2. Weight x Height
3. Bmi x Height
4. Bmi x Age
5. Weight x Age
6. Height x Age
According to the data visualization we can conclude few things such
as:
1. Most of the sample are classified as Normal
2. Each BmiClass has the age distibution homogenously where there are no
outliers included
3. Each numerical variable has a correlation where Bmi is most
correlated with the Weight
4. There isn’t much to see in this dataset because there aren’t many
variables in it, however multiple linear regression or logistic
regression are recommended methods for analyzing the data.